I am using a dataset that contains Beatles song lyrics for songs from 13 albums. I found the dataset on a public GitHub page. I will compare lyrics primarily grouped by album.
Import Beatles lyrics dataset, remove two songs that have no lyrics.
import pandas as pd
lyrics = pd.read_excel('beatles_lyrics.xlsx')
indexfly = lyrics[(lyrics['SONG'] == "Flying")].index
lyrics.drop(indexfly , inplace=True)
index9 = lyrics[(lyrics['SONG'] == "Revolution 9")].index
lyrics.drop(index9 , inplace=True)
Functions to remove punctuation and stopwords
#function used in NLP class to remove punctuation
def clean_txt(var_in):
import re
tmp = re.sub("[^0-9A-Za-z']", " ", var_in).lower()
return tmp
#function used in NLP class to remove stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def rem_sw(var_in):
sw = list(set(stopwords.words('english')))
tmp = [word for word in var_in.split() if word not in sw]
return ' '.join(tmp)
/Applications/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5)
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/starshine1000/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Clean text and create new column
lyrics["clean_lyrics"] = lyrics.LYRICS.apply(clean_txt).apply(rem_sw)
len(lyrics)
184
The dataset has 184 songs
lyrics.head()
| ALBUM | SONG | LYRICS | EARLY_LATE | COMPOSER | YEAR | clean_lyrics | |
|---|---|---|---|---|---|---|---|
| 0 | A. Please Please Me | A Taste of Honey | A taste of honey... tasting much sweeter than ... | Early | McCartney | 1963 | taste honey tasting much sweeter wine dream fi... |
| 1 | A. Please Please Me | Anna (Go To Him) | Anna, You come and ask me, girl, To set you fr... | Early | Lennon | 1963 | anna come ask girl set free girl say loves set... |
| 2 | A. Please Please Me | Ask Me Why | I love you Can't you tell me things I want to ... | Early | Lennon | 1963 | love can't tell things want know true really g... |
| 3 | A. Please Please Me | Baby It's You | Sha la la la la la la la Sha la la la la la la... | Early | Lennon | 1963 | sha la la la la la la la sha la la la la la la... |
| 4 | A. Please Please Me | Boys | I been told when a boy kiss a girl, Take a tri... | Early | Starr | 1963 | told boy kiss girl take trip around world hey ... |
Lyrics could be grouped by "EARLY_LATE" (whether the song album was from the Beatles early or late period), composer, year, or album. I am going to analyze lyric differences between albums.
Function to get count-vectorized or tf-idf matrix in dataframe form (converting to "bag of words")
#function used in NLP class
def transform_fun(df_in, col_name_in, m_in, n_in, sw):
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
if sw == "vec":
my_vec = CountVectorizer(ngram_range=(m_in, n_in))
if sw == "tf-idf":
my_vec = TfidfVectorizer(ngram_range=(m_in, n_in))
xform_data = pd.DataFrame(my_vec.fit_transform(
list(df_in[col_name_in])).toarray())
xform_data.columns = my_vec.get_feature_names_out()
return xform_data
Aggregating words by album and vectorizing
lyrics_album = lyrics.drop(columns = ["SONG","LYRICS","EARLY_LATE","COMPOSER","YEAR"])
lyrics_album = lyrics_album.groupby(['ALBUM'])['clean_lyrics'].apply(' '.join).reset_index()
vectorized_albums = transform_fun(
lyrics_album, "clean_lyrics", 1, 1, "vec")
Dropping some commonly occuring "words" that are really just exclamations or parts of contractions
vectorized_albums = vectorized_albums.drop(columns=['ll','ve','la','oh','ev','ah','mm','yeh','oo'])
Transpose dataframe so columns are each album, rows are words and values are word counts
vectorized_albums.set_index(lyrics_album.ALBUM, inplace = True)
transposed_lyrics = vectorized_albums.transpose()
transposed_lyrics.head()
| ALBUM | A. Please Please Me | B. With The Beatles | C. A Hard Day's Night | D. Beatles For Sale | E. Help! | F. Rubber Soul | G. Revolver | H. Sgt. Pepper's Lonely Hearts Club Band | I. Magical Mystery Tour | J. The Beatles (aka The White Album) | K. Yellow Submarine | L. Abbey Road | M. Let It Be |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| aaaaah | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| aaahhh | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| able | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| abord | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| accidents | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Top 20 most used words in all albums:
sum = vectorized_albums.sum()
sum.name = 'Sum'
topword = sum.sort_values(ascending=False).head(20)
topword
love 393 know 245 see 155 girl 155 got 154 want 145 say 145 come 128 like 127 baby 112 get 110 back 110 time 103 tell 101 one 100 never 96 need 95 can 92 man 91 home 87 Name: Sum, dtype: int64
It is certainly no surprise that "love" would be by far the most commonly occuring word throughout all albums. This is just the raw count of occurrences of a word (I didn't do anything to remove multiple occurrences of a word in a song or album), so I'm thinking a lot of this probably comes from "All You Need Is Love," in which the word "love" is repeated many times. We can test this theory if we break the top songs down by album.
I would assume that there would be a wide variation of word occurences and most popular words between albums. The Beatles are known for going from more generic or traditional lyrics and songwriting in their early years to avant-garde and complex lyrics in their later albums, so I'm thinking that later albums will have more unusual popular words and earlier albums will have more common popular words. Let's chart the 10 most popular songs in each album and compare:
album_abbrev = ["ppm","wtb","hdn","bfs","h","rs","rev","sgt","mmt","wa","ys","ar","lib"]
counter = 0
for album in transposed_lyrics.columns:
album_abbrev[counter] = transposed_lyrics[[album]].sort_values(by = [album], ascending=False).head(10)
album_abbrev[counter]["word"] = album_abbrev[counter].index
counter += 1
#@title Most Popular Words in Each Album
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=4, cols=4,
subplot_titles= ["Please Please Me","With the Beatles","A Hard Day's Night",
"Beatles For Sale","Help!","Rubber Soul","Revolver",
"Sgt. Pepper","Magical Myster Tour",
"White Album","Yellow Submarine","Abbey Road","Let it Be"])
fig.add_trace(
go.Bar(x=album_abbrev[0]['word'],y=album_abbrev[0]["A. Please Please Me"], name = "Please Please e"),
row=1, col=1
)
fig.add_trace(
go.Bar(x=album_abbrev[1]['word'],y=album_abbrev[1]["B. With The Beatles"], name = "With the Beatles"),
row=1, col=2
)
fig.add_trace(
go.Bar(x=album_abbrev[2]['word'],y=album_abbrev[2]["C. A Hard Day's Night"], name ="A Hard Day's Night"),
row=1, col=3
)
fig.add_trace(
go.Bar(x=album_abbrev[3]['word'],y=album_abbrev[3]["D. Beatles For Sale"], name = "Beatles For Sale"),
row=1, col=4
)
fig.add_trace(
go.Bar(x=album_abbrev[4]['word'],y=album_abbrev[4]["E. Help!"], name ="Help!"),
row=2, col=1
)
fig.add_trace(
go.Bar(x=album_abbrev[5]['word'],y=album_abbrev[5]["F. Rubber Soul"], name = "Rubber Soul"),
row=2, col=2
)
fig.add_trace(
go.Bar(x=album_abbrev[6]['word'],y=album_abbrev[6]["G. Revolver"], name ="Revolver"),
row=2, col=3
)
fig.add_trace(
go.Bar(x=album_abbrev[7]['word'],y=album_abbrev[7]["H. Sgt. Pepper's Lonely Hearts Club Band"], name = "Sgt. Pepper"),
row=2, col=4
)
fig.add_trace(
go.Bar(x=album_abbrev[8]['word'],y=album_abbrev[8]["I. Magical Mystery Tour"], name ="Magical Mystery Tour"),
row=3, col=1
)
fig.add_trace(
go.Bar(x=album_abbrev[9]['word'],y=album_abbrev[9]["J. The Beatles (aka The White Album)"], name = "White Album"),
row=3, col=2
)
fig.add_trace(
go.Bar(x=album_abbrev[10]['word'],y=album_abbrev[10]["K. Yellow Submarine"], name ="Yellow Submarine"),
row=3, col=3
)
fig.add_trace(
go.Bar(x=album_abbrev[11]['word'],y=album_abbrev[11]["L. Abbey Road"], name = "Abbey Road"),
row=3, col=4
)
fig.add_trace(
go.Bar(x=album_abbrev[12]['word'],y=album_abbrev[12]["M. Let It Be"], name = "Let It Be"),
row=4, col=1
)
fig.update_layout(height=1000, width=1000, title_text="Top 10 Words for Each Album", showlegend=False)
fig.show()
Actually, the word "love" shows up the most in A Hard Day's Night, likely from the song "Can't Buy Me Love." For many albums, it looks like the top words are coming mostly from one song that has a lot of repeated words in a chorus. Some albums have much more often repeated words than others. For instance, A Hard Day's Night has "love" repeated over 80x. Please Please me has love repeated almost 60x. On the flip side, Revolver, Sgt. Pepper and Abbey Road's most used words were all used below 30x.
Very interestingly, this Rolling Stone poll of top Beatles albums put Revolver, Sgt. Pepper and Abbey Road as reader's highest ranked Beatles albums - the three albums with the least often repeated top words. Perhaps fans prefer albums with songs that have more diverse lyrics.
On the whole, later albums do appear to have top words repeated fewer times than earlier albums. It's hard to say for sure if the top words in later albums are more unusual than the top words in earlier albums, because there are some pretty weird words coming up in earlier albums ("shuop"). It would be interesting to do a part of speech analysis on this in the future.
Another interesting thing to note is the change in popularity of the word "girl" throughout the albums. "Girl" was highly ranked in Please Please me and Beatles for Sale, then first ranked in the following two albums Help! and Rubber Soul. After Rubber Soul, "girl" didn't make it into the top words for any album again. This coincides with the Beatles' switch from writing albums and touring to experimenting and innovating in the studio. They were no longer interested in playing directly for fans, many of them young girls.
Another way to compare word popularity across albums is with tf-idf rather than simple count. Tf-idf penalizes common words accross all albums, so using tf-idf rather than count will highlight words that are most uniquely popular for each album. I'm limiting my analysis to top 100 words so that the results can be more easily interpretable.
#I'm altering the function to include a variable "max_in" which will give me the
#ability to limit the amount of words that the tf-idf algorithm includes.
def transform_fun_top(df_in, col_name_in, m_in, n_in, sw, max_in):
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
if sw == "vec":
my_vec = CountVectorizer(ngram_range=(m_in, n_in))
if sw == "tf-idf":
my_vec = TfidfVectorizer(ngram_range=(m_in, n_in),max_features = max_in)
xform_data = pd.DataFrame(my_vec.fit_transform(
list(df_in[col_name_in])).toarray())
xform_data.columns = my_vec.get_feature_names_out()
return xform_data
tfidf_albums = transform_fun_top(
lyrics_album, "clean_lyrics", 1, 1, "tf-idf",100)
tfidf_albums = tfidf_albums.drop(columns=['ll','ve','la','oh','ev','ah','mm','yeh'])
tfidf_albums.set_index(lyrics_album.ALBUM, inplace = True)
tfidf_lyrics = tfidf_albums.transpose()
tfidf_lyrics["word"] = tfidf_lyrics.index
Ordering by word tf-idf for their first album.
tfidf_lyrics[tfidf_lyrics["word"].isin(topword.index)].sort_values(by = ["A. Please Please Me"], ascending=False)
| ALBUM | A. Please Please Me | B. With The Beatles | C. A Hard Day's Night | D. Beatles For Sale | E. Help! | F. Rubber Soul | G. Revolver | H. Sgt. Pepper's Lonely Hearts Club Band | I. Magical Mystery Tour | J. The Beatles (aka The White Album) | K. Yellow Submarine | L. Abbey Road | M. Let It Be | word |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| love | 0.328369 | 0.079971 | 0.542662 | 0.339799 | 0.092957 | 0.297787 | 0.186552 | 0.185758 | 0.216589 | 0.147381 | 0.395276 | 0.103166 | 0.057509 | love |
| baby | 0.200204 | 0.052937 | 0.007924 | 0.239628 | 0.041205 | 0.140194 | 0.000000 | 0.019213 | 0.118598 | 0.091462 | 0.000000 | 0.000000 | 0.059481 | baby |
| know | 0.190108 | 0.085302 | 0.070227 | 0.115840 | 0.059758 | 0.236176 | 0.168785 | 0.077399 | 0.216589 | 0.311767 | 0.056468 | 0.193436 | 0.038339 | know |
| come | 0.160882 | 0.097350 | 0.061716 | 0.041475 | 0.071318 | 0.066177 | 0.028625 | 0.033254 | 0.020527 | 0.213100 | 0.000000 | 0.096960 | 0.051475 | come |
| girl | 0.119680 | 0.018460 | 0.066316 | 0.115871 | 0.321859 | 0.450354 | 0.000000 | 0.053598 | 0.007352 | 0.098134 | 0.000000 | 0.074418 | 0.088498 | girl |
| never | 0.099004 | 0.085897 | 0.027430 | 0.033180 | 0.142637 | 0.022059 | 0.076334 | 0.083135 | 0.034212 | 0.018266 | 0.000000 | 0.152366 | 0.030885 | never |
| say | 0.086629 | 0.000000 | 0.123433 | 0.049770 | 0.035659 | 0.242649 | 0.028625 | 0.116389 | 0.287378 | 0.079151 | 0.040435 | 0.138514 | 0.010295 | say |
| can | 0.057609 | 0.042651 | 0.236218 | 0.030891 | 0.033199 | 0.051343 | 0.053300 | 0.030960 | 0.031851 | 0.022674 | 0.037645 | 0.012896 | 0.009585 | can |
| like | 0.057609 | 0.074639 | 0.057458 | 0.054059 | 0.132796 | 0.092417 | 0.071067 | 0.123839 | 0.038222 | 0.062353 | 0.047057 | 0.116062 | 0.105433 | like |
| want | 0.049502 | 0.274871 | 0.041144 | 0.016590 | 0.057055 | 0.110295 | 0.104959 | 0.049881 | 0.006842 | 0.066974 | 0.000000 | 0.373988 | 0.102951 | want |
| back | 0.046542 | 0.024613 | 0.036842 | 0.071305 | 0.130276 | 0.094811 | 0.000000 | 0.035732 | 0.029409 | 0.124304 | 0.000000 | 0.104185 | 0.320806 | back |
| got | 0.046087 | 0.111959 | 0.127685 | 0.100395 | 0.146075 | 0.051343 | 0.088834 | 0.092879 | 0.019111 | 0.045348 | 0.009411 | 0.180540 | 0.220450 | got |
| see | 0.046087 | 0.063977 | 0.051074 | 0.023168 | 0.139436 | 0.236176 | 0.168785 | 0.108359 | 0.101924 | 0.136044 | 0.028234 | 0.103166 | 0.028754 | see |
| tell | 0.039893 | 0.055379 | 0.154736 | 0.089132 | 0.084296 | 0.071109 | 0.143539 | 0.017866 | 0.007352 | 0.091592 | 0.000000 | 0.119069 | 0.000000 | tell |
| one | 0.034565 | 0.042651 | 0.038306 | 0.100395 | 0.026559 | 0.030806 | 0.124368 | 0.077399 | 0.031851 | 0.045348 | 0.047057 | 0.167645 | 0.095848 | one |
| time | 0.017283 | 0.031988 | 0.051074 | 0.100395 | 0.073038 | 0.143759 | 0.115484 | 0.154798 | 0.031851 | 0.045348 | 0.037645 | 0.038687 | 0.047924 | time |
| home | 0.013298 | 0.086145 | 0.162105 | 0.044566 | 0.000000 | 0.035554 | 0.010253 | 0.160794 | 0.000000 | 0.058881 | 0.010862 | 0.044651 | 0.199121 | home |
| need | 0.006188 | 0.011453 | 0.048002 | 0.058065 | 0.064187 | 0.044118 | 0.104959 | 0.116389 | 0.130005 | 0.024354 | 0.192067 | 0.069257 | 0.000000 | need |
| get | 0.000000 | 0.040085 | 0.116576 | 0.033180 | 0.057055 | 0.044118 | 0.095417 | 0.182897 | 0.041054 | 0.054797 | 0.010109 | 0.096960 | 0.267672 | get |
| man | 0.000000 | 0.125982 | 0.006857 | 0.024885 | 0.021396 | 0.187502 | 0.047709 | 0.083135 | 0.130005 | 0.036531 | 0.020218 | 0.083108 | 0.020590 | man |
By plotting word tf-idf for one album on one axis and another album on another axis, we can get a sense of which popular words are shared between albums and which words are unique to an album.
With the Beatles vs. Yellow Submarine
import plotly.express as px
fig = px.scatter(tfidf_lyrics, x = "B. With The Beatles", y = "K. Yellow Submarine", hover_name = "word", text = "word", opacity = 0)
fig.update_layout(height=600, width=1000)
fig.show()
Not very many words are in the middle of the graph, indicating that these two albums are not very similar in their vocabularies.
An interesting thing to note is that With the Beatles has a high-ish tf-idf for "wanna" and "want" while Yellow Submarine has a high-ish tf-idf for "need". A similar kind of meaning, but different words with different feelings.
Rubber Soul vs. Please Please Me
import plotly.express as px
fig = px.scatter(tfidf_lyrics, x = "F. Rubber Soul", y = "A. Please Please Me", hover_name = "word", text = "word", opacity = 0)
fig.update_layout(height=600, width=1000)
fig.show()
There are many more words in the center of the graph for these two albums compared to the previous two albums. We can see that these two albums are more simmilar in their vocabularies.
We can formally, statistically assess word frequency correlations with a correlation table (graphed onto a heatmap). I'm going to compare correlations in word count for all words in the corpus.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(transposed_lyrics.corr(), annot=True, cmap=sns.color_palette("coolwarm", 6))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12);
There are a couple albums that we can pick out that seem to be more correlated with other albums, meaning that they have more words in common with other albums and are potentially less unique in their lyrics. Please Please Me stands out as having has higher correlations with several other albums than most other albums have. Rubber Soul also stands out as having a good amount of higher correlations. Perhaps these two albums have more generic lyrics when considered against the entire Beatles catalogue.
Conversely, Yellow Submarine has notably lower correlations with many albums than other albums have with each other. This makes sense to me in light of the previous analysis that showed that two of Yellow Submarine's most popular words are "yellow" and "submarine." Clearly, those words are unlikely to be popular in other albums, so that would decrease the album's word correlation with other albums.
Next I want to take a look at relative frequencies of words used in songs credited only to John Lennon vs. songs credited only to Paul McCartney. I'm omitting any Lennon-McCartney songs.
composer = lyrics.drop(columns = ["SONG","LYRICS","EARLY_LATE","ALBUM","YEAR"])
composer.groupby(['COMPOSER']).size().nlargest(10)
COMPOSER Lennon 65 McCartney 59 Harrison 26 Lennon/McCartney 19 Starr 11 Lennon/Harrison 2 Lennon/McCartney/Harrison 1 Lennon/McCartney/Harrison/Starr 1 dtype: int64
There are roughly the same amount of songs composed by John Lennon vs. Paul McCartney in the dataset.
composer = composer.groupby(['COMPOSER'])['clean_lyrics'].apply(' '.join).reset_index()
composer_lenmac = composer[composer["COMPOSER"].isin(["Lennon","McCartney"])]
composer_lenmac
| COMPOSER | clean_lyrics | |
|---|---|---|
| 1 | Lennon | anna come ask girl set free girl say loves set... |
| 6 | McCartney | taste honey tasting much sweeter wine dream fi... |
Creating a bag of words broken down by composer (Lennon or McCartney):
vec_lenmac = transform_fun(
composer_lenmac, "clean_lyrics", 1, 1, "vec")
vec_lenmac = vec_lenmac.drop(columns=['ll','ve','la','oh','ev','ah','mm','yeh'])
vec_lenmac.set_index(composer_lenmac.COMPOSER, inplace = True)
composer_lyrics = vec_lenmac.transpose()
composer_lyrics.head()
| COMPOSER | Lennon | McCartney |
|---|---|---|
| aaaaah | 0 | 1 |
| able | 0 | 1 |
| accidents | 1 | 0 |
| aches | 0 | 2 |
| across | 4 | 3 |
We can sort the dataframe to see top words for each:
composer_lyrics.sort_values(by = ["McCartney"], ascending=False).head(5)
| COMPOSER | Lennon | McCartney |
|---|---|---|
| love | 142 | 91 |
| back | 18 | 72 |
| say | 39 | 65 |
| know | 112 | 60 |
| get | 29 | 58 |
composer_lyrics.sort_values(by = ["Lennon"], ascending=False).head(5)
| COMPOSER | Lennon | McCartney |
|---|---|---|
| love | 142 | 91 |
| know | 112 | 60 |
| girl | 94 | 35 |
| want | 89 | 21 |
| got | 70 | 24 |
Generating word frequencies for each
sum_composer = composer_lyrics.sum()
counter = 0
for x in composer_lyrics.columns:
text = x + " frequencies"
composer_lyrics[text] = composer_lyrics[x]/sum_composer[counter]
counter += 1
print(x)
Lennon McCartney
Getting relative frequencies:
composer_lyrics["Lennon vs. McCartney"] = composer_lyrics["Lennon frequencies"] - composer_lyrics["McCartney frequencies"]
composer_lyrics["word"] = composer_lyrics.index
Part of speech tagging:
nltk.download('averaged_perceptron_tagger')
a = nltk.pos_tag(composer_lyrics["word"])
b = [el[1] for el in a]
composer_lyrics["POS"] = b
[nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /Users/starshine1000/nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date!
Getting Vader sentiment score for each word:
%%capture
pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sid_obj = SentimentIntensityAnalyzer()
sent = composer_lyrics["word"].apply(sid_obj.polarity_scores)
c = [el['compound'] for el in sent]
composer_lyrics["vader_sentiment"] = c
We can sort to see the most Lennon-like and McCartney-like words:
composer_lyrics.sort_values(by = ["Lennon vs. McCartney"], ascending=False).head(10)
| COMPOSER | Lennon | McCartney | Lennon frequencies | McCartney frequencies | Lennon vs. McCartney | word | POS | vader_sentiment |
|---|---|---|---|---|---|---|---|---|
| want | 89 | 21 | 0.016664 | 0.004696 | 0.011968 | want | VBP | 0.0772 |
| girl | 94 | 35 | 0.017600 | 0.007826 | 0.009773 | girl | NN | 0.0000 |
| got | 70 | 24 | 0.013106 | 0.005367 | 0.007739 | got | VBD | 0.0000 |
| know | 112 | 60 | 0.020970 | 0.013417 | 0.007553 | know | VBP | 0.0000 |
| nothing | 44 | 4 | 0.008238 | 0.000894 | 0.007344 | nothing | NN | 0.0000 |
| going | 52 | 12 | 0.009736 | 0.002683 | 0.007053 | going | VBG | 0.0000 |
| cry | 42 | 5 | 0.007864 | 0.001118 | 0.006746 | cry | NN | -0.4767 |
| that | 44 | 8 | 0.008238 | 0.001789 | 0.006449 | that | WDT | 0.0000 |
| come | 59 | 21 | 0.011047 | 0.004696 | 0.006351 | come | VBN | 0.0000 |
| lose | 36 | 2 | 0.006740 | 0.000447 | 0.006293 | lose | VB | -0.4019 |
composer_lyrics.sort_values(by = ["Lennon vs. McCartney"], ascending=True).head(10)
| COMPOSER | Lennon | McCartney | Lennon frequencies | McCartney frequencies | Lennon vs. McCartney | word | POS | vader_sentiment |
|---|---|---|---|---|---|---|---|---|
| back | 18 | 72 | 0.003370 | 0.016100 | -0.012730 | back | RB | 0.0 |
| bye | 0 | 45 | 0.000000 | 0.010063 | -0.010063 | bye | VBP | 0.0 |
| hello | 0 | 40 | 0.000000 | 0.008945 | -0.008945 | hello | NN | 0.0 |
| get | 29 | 58 | 0.005430 | 0.012970 | -0.007540 | get | VBP | 0.0 |
| say | 39 | 65 | 0.007302 | 0.014535 | -0.007233 | say | VBP | 0.0 |
| together | 7 | 33 | 0.001311 | 0.007379 | -0.006069 | together | RB | 0.0 |
| day | 13 | 38 | 0.002434 | 0.008497 | -0.006063 | day | NN | 0.0 |
| night | 5 | 30 | 0.000936 | 0.006708 | -0.005772 | night | NN | 0.0 |
| never | 25 | 46 | 0.004681 | 0.010286 | -0.005605 | never | RB | 0.0 |
| honey | 1 | 24 | 0.000187 | 0.005367 | -0.005179 | honey | NN | 0.0 |
Overall, the most Lennon-like words are slightly more negative-leaning than the most McCartney-like words. None of the most McCartney-like words are marked as having any sentiment score using the vader sentiment package.
I see a couple differences in part of speech patterns between the top Lennon-like and McCartney-like words. There are twice as many verbs in the Lennon-like words. There are 3 adverbs in the McCartney-like word list, and no adverbs in the Lennon-like word list.
Subjectively, the Lennon-like words thematically seem to suggest more angst and longing while the McCartney-like words are more peppy or active (maybe?). This fits with the public perception of their personalities.
Lastly, I'm going to run an LDA on the albums to see if I can pull out any distinct topics from the lyrics in each album. This analysis will show which topics are predominant for which album. We will also be able to see which words contribute most to each topic.
Because I'm not going to get too involved and use gridsearch, I have to choose how many topics I want the LDA to fit the data to. If I was going to seriously try and optimize this to find the best fit I would use gridsearch to get the best-fitting number of topics.
For now, I am choosing just 2 topics for the LDA. This is based on my theory that there are potentially two major themes in the Beatle's albums lyrics - mostly based on the difference between early vs. late Beatles.
from sklearn.decomposition import LatentDirichletAllocation
#Code based off of this article: https://yanlinc.medium.com/how-to-build-a-lda-topic-model-using-from-text-601cdcbfd3a6
# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=2, # Number of topics
max_iter=200,
# Max learning iterations
learning_method='online',
random_state=100,
# Random state
batch_size=13,
# n docs in each learning iter
evaluate_every = -1,
# compute perplexity every n iters, default: Don't
n_jobs = -1,
# Use all available CPUs
)
lda_output = lda_model.fit_transform(vectorized_albums)
print(lda_model) # Model attributes
LatentDirichletAllocation(batch_size=13, learning_method='online', max_iter=200,
n_components=2, n_jobs=-1, random_state=100)
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(vectorized_albums))
# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(vectorized_albums))
# See model parameters
print(lda_model.get_params())
Log Likelihood: -95776.56263815098
Perplexity: 796.6192469621471
{'batch_size': 13, 'doc_topic_prior': None, 'evaluate_every': -1, 'learning_decay': 0.7, 'learning_method': 'online', 'learning_offset': 10.0, 'max_doc_update_iter': 100, 'max_iter': 200, 'mean_change_tol': 0.001, 'n_components': 2, 'n_jobs': -1, 'perp_tol': 0.1, 'random_state': 100, 'topic_word_prior': None, 'total_samples': 1000000.0, 'verbose': 0}
See the dominant topic for each album:
import numpy as np
# Create Document — Topic Matrix
lda_output = lda_model.transform(vectorized_albums)
# column names
topicnames = ['Topic' + str(i) for i in range(lda_model.n_components)]
# index names
docnames = [vectorized_albums.index]
# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 4), columns=topicnames, index=docnames)
df_document_topic
| Topic0 | Topic1 | |
|---|---|---|
| ALBUM | ||
| A. Please Please Me | 0.0010 | 0.9990 |
| B. With The Beatles | 0.0008 | 0.9992 |
| C. A Hard Day's Night | 0.9991 | 0.0009 |
| D. Beatles For Sale | 0.9989 | 0.0011 |
| E. Help! | 0.9993 | 0.0007 |
| F. Rubber Soul | 0.9991 | 0.0009 |
| G. Revolver | 0.0011 | 0.9989 |
| H. Sgt. Pepper's Lonely Hearts Club Band | 0.9994 | 0.0006 |
| I. Magical Mystery Tour | 0.0007 | 0.9993 |
| J. The Beatles (aka The White Album) | 0.9997 | 0.0003 |
| K. Yellow Submarine | 0.0014 | 0.9986 |
| L. Abbey Road | 0.0008 | 0.9992 |
| M. Let It Be | 0.9993 | 0.0007 |
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorized_albums.columns
df_topic_keywords.index = topicnames
# View
df_topic_keywords.head()
| aaaaah | aaahhh | able | abord | accidents | aches | across | act | acts | admit | ... | years | yellow | yes | yesterday | yet | young | younger | zapped | zoo | zoom | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Topic0 | 0.500041 | 0.500081 | 1.499954 | 0.500084 | 1.499949 | 0.500088 | 8.499658 | 7.502298 | 1.499882 | 7.499661 | ... | 10.500334 | 1.479773 | 69.509745 | 1.499918 | 2.500449 | 5.499770 | 4.499806 | 1.499949 | 0.500049 | 0.500070 |
| Topic1 | 1.499963 | 2.499882 | 0.500046 | 2.499891 | 0.500048 | 2.499874 | 0.500073 | 1.497442 | 0.500117 | 0.500121 | ... | 7.499069 | 51.518341 | 13.487255 | 0.500079 | 2.499438 | 0.500086 | 0.500084 | 0.500045 | 2.499911 | 1.499925 |
2 rows × 2114 columns
See the most important words for each topic:
# Show top n keywords for each topic
def show_topics(lda_model=lda_model, n_words=20):
keywords = np.array(vectorized_albums.columns)
topic_keywords = []
for topic_weights in lda_model.components_:
top_keyword_locs = (-topic_weights).argsort()[:n_words]
topic_keywords.append(keywords.take(top_keyword_locs))
return topic_keywords
topic_keywords = show_topics(lda_model=lda_model, n_words=20)
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords
| Word 0 | Word 1 | Word 2 | Word 3 | Word 4 | Word 5 | Word 6 | Word 7 | Word 8 | Word 9 | Word 10 | Word 11 | Word 12 | Word 13 | Word 14 | Word 15 | Word 16 | Word 17 | Word 18 | Word 19 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Topic 0 | love | girl | know | got | see | back | get | like | come | say | yes | time | home | going | tell | baby | yeah | can | let | night |
| Topic 1 | love | know | want | say | see | long | need | got | come | never | man | like | yellow | baby | one | submarine | please | well | day | hello |
There are a similar number of albums falling into the two topics. It isn't a clean split between early and late albums, but the first topic seems to lean earlier than the second topic.
Looking at the most important words for each topic, they appear to be similar in a lot of ways with love being the top word for both, and know, see and come also important for both.
The main difference is that the first topic has girl as the second most important word, while the second topic doesn't have girl show up at all in the top 20 most important words. That is interesting in light of other analyses in this workbook that have uncovered patterns in the popularity and usage of "girl". We saw that "girl" was highly ranked for earlier albums and didn't show up as a top word again after Rubber Soul. We also saw that "girl" was a highly-ranked Lennon-like word.
I'm going to quickly chart the number of Lennon vs. McCartney composed songs in each album:
#@title Lennon Vs. McCartney Song Number over Albums
composer_album = lyrics.drop(columns = ["LYRICS","EARLY_LATE","YEAR","clean_lyrics"])
composer_album = composer_album[composer_album["COMPOSER"].isin(["Lennon","McCartney"])].groupby(['ALBUM','COMPOSER'])['SONG'].count().reset_index()
plt.figure(figsize=(50,2))
fig = px.line(composer_album, x="ALBUM", y="SONG", color='COMPOSER',markers=True)
fig.update_layout(height=400, width=1000)
fig.show()
<Figure size 3600x144 with 0 Axes>
More songs were composed only by John Lennon in earlier albums. Later albums had more songs compsoed by McCartney for the most part. This pattern might help to explain why an album would be dominantly topic 0 or topic 1. Perhaps albums with more Lennon songs would have more topic 0 while albums with more McCartney songs would have more topic 1. We can check that:
m = composer_album.groupby(['ALBUM'])['SONG'].max().reset_index().merge(composer_album, on = ['ALBUM','SONG'],how = 'left').groupby(['ALBUM','SONG'])['COMPOSER'].apply(','.join).reset_index().rename(columns={"COMPOSER": "main_composer"})
m.merge(df_document_topic, on = ['ALBUM'], how = 'left')
| ALBUM | SONG | main_composer | Topic0 | Topic1 | |
|---|---|---|---|---|---|
| 0 | A. Please Please Me | 5 | Lennon | 0.0010 | 0.9990 |
| 1 | B. With The Beatles | 5 | Lennon | 0.0008 | 0.9992 |
| 2 | C. A Hard Day's Night | 7 | Lennon | 0.9991 | 0.0009 |
| 3 | D. Beatles For Sale | 5 | Lennon | 0.9989 | 0.0011 |
| 4 | E. Help! | 6 | Lennon | 0.9993 | 0.0007 |
| 5 | F. Rubber Soul | 5 | Lennon | 0.9991 | 0.0009 |
| 6 | G. Revolver | 5 | Lennon,McCartney | 0.0011 | 0.9989 |
| 7 | H. Sgt. Pepper's Lonely Hearts Club Band | 6 | McCartney | 0.9994 | 0.0006 |
| 8 | I. Magical Mystery Tour | 5 | McCartney | 0.0007 | 0.9993 |
| 9 | J. The Beatles (aka The White Album) | 11 | Lennon,McCartney | 0.9997 | 0.0003 |
| 10 | K. Yellow Submarine | 2 | Lennon | 0.0014 | 0.9986 |
| 11 | L. Abbey Road | 8 | McCartney | 0.0008 | 0.9992 |
| 12 | M. Let It Be | 3 | Lennon,McCartney | 0.9993 | 0.0007 |
2/3 McCartney dominated albums are categorized as Topic 1, and 4/7 Lennon dominated albums are categorized as Topic 0. Maybe there is some small effect here with McCartney dominated albums leaning more toward topic 1 and lennon dominated albums leaning more toward topic 0.
Now I'm curious to look on a song level to see which composers have more songs from each topic:
composer = composer.groupby(['COMPOSER'])['clean_lyrics'].apply(' '.join).reset_index()
Vectorizing by song:
lyrics_song = lyrics.drop(columns = ["ALBUM","LYRICS","EARLY_LATE","COMPOSER","YEAR"])
lyrics_song = lyrics_song.groupby(['SONG'])['clean_lyrics'].apply(' '.join).reset_index()
vectorized_songs = transform_fun(
lyrics_song, "clean_lyrics", 1, 1, "vec")
vectorized_songs = vectorized_songs.drop(columns=['ll','ve','la','oh','ev','ah','mm','yeh','oo'])
vectorized_songs.set_index(lyrics_song.SONG, inplace = True)
vectorized_songs.head()
| aaaaah | aaahhh | able | abord | accidents | aches | across | act | acts | admit | ... | years | yellow | yes | yesterday | yet | young | younger | zapped | zoo | zoom | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SONG | |||||||||||||||||||||
| A Day In The Life | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A Hard Day's Night | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A Taste of Honey | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Across The Universe | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Act Naturally | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 2114 columns
Applying LDA topics to songs:
lda_output = lda_model.transform(vectorized_songs)
topicnames = ['Topic' + str(i) for i in range(lda_model.n_components)]
docnames = [vectorized_songs.index]
df_document_topic = pd.DataFrame(np.round(lda_output, 4), columns=topicnames, index=docnames)
df_document_topic.head()
| Topic0 | Topic1 | |
|---|---|---|
| SONG | ||
| A Day In The Life | 0.9940 | 0.0060 |
| A Hard Day's Night | 0.9930 | 0.0070 |
| A Taste of Honey | 0.1449 | 0.8551 |
| Across The Universe | 0.9954 | 0.0046 |
| Act Naturally | 0.9930 | 0.0070 |
song_composer = lyrics.drop(columns = ["ALBUM","LYRICS","EARLY_LATE","YEAR","clean_lyrics"])
song_topic_composer = df_document_topic.merge(song_composer, on='SONG', how='left')
song_topic_composer.head()
| SONG | Topic0 | Topic1 | COMPOSER | |
|---|---|---|---|---|
| 0 | A Day In The Life | 0.9940 | 0.0060 | Lennon/McCartney |
| 1 | A Hard Day's Night | 0.9930 | 0.0070 | Lennon/McCartney |
| 2 | A Taste of Honey | 0.1449 | 0.8551 | McCartney |
| 3 | Across The Universe | 0.9954 | 0.0046 | Lennon |
| 4 | Act Naturally | 0.9930 | 0.0070 | Starr |
#@title Composer topic percentages
composer_topic_avg = song_topic_composer.groupby(['COMPOSER'])['Topic0'].mean().reset_index().merge(song_topic_composer.groupby(['COMPOSER'])['Topic1'].mean().reset_index(), on = "COMPOSER", how = 'left')
df = pd.wide_to_long(composer_topic_avg, stubnames = 'Topic', i=['COMPOSER'], j = 'topic_number').reset_index()
fig = px.bar(df, x='COMPOSER', y='Topic', color = df['topic_number'].astype(object))
fig.add_hline(y=0.5)
fig.update_layout(height=400, width=1000,legend_title_text='T')
fig.show()
Now considering all composers, we see that George Harrison is the only composer with more songs having topic 1 than topic 0. Paul McCartney and John Lennon have very similar shares of topics, though their combined Lennon/McCartney composed songs have notably more topic 0.
This somewhat refutes my theory that more McCartney songs are more heavily topic 1 while more Lennon songs are more heavily topic 0.